NSF PAR Search | NSF Public Access Repository

https://doi.org/10.1038/s42256-025-01011-z

Liu, Shengchao; Li, Yanjing; Li, Zhuoxinran; Gitter, Anthony; Zhu, Yutao; Lu, Jiarui; Xu, Zhao; Nie, Weili; Ramanathan, Arvind; Xiao, Chaowei; et al (March 2025, Nature Machine Intelligence)

Current AI-assisted protein design utilizes mainly protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in text format describing proteins’ high-level functionalities, yet whether the incorporation of such text data can help in protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multimodal framework that leverages textual descriptions for protein design. ProteinDT consists of three consecutive steps: ProteinCLAP, which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441,000 text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.

Free, publicly-accessible full text available March 27, 2026

Search for: All records